47 research outputs found

    Bringing Hadoop into Bioinformatics with Cloudgene and CloudMan

    Get PDF
    Despite the evident potential of the MapReduce model and existence of bioinformatic algorithms and applications, those are still to become widely adopted in the bioinformatics data analysis. The Hadoop MapReduce model offers a simple framework for data parallelism by providing automated runtime recovery (for both task runtime and hardware failures), implicit scalability (tasks automatically run in parallel batch mode), as well as data replication and locality (reduce data movement, hence increase processing capacity). We identify two prerequisites for wider adoption and higher utilization of MapReduce tools: (1) abstract the technical details of how multiple existing MapReduce tools are composed, and (2) provide easy access to the necessary compute infrastructure and the appropriate environment. Satisfying these requirements would allow bioinformatics domain experts to focus on the analysis while the required technical details are hidden. At BOSC 2012, two platforms were presented: Cloudgene a MapReduce tool execution platform leveraging Hadoop, and CloudMan a cloud resource manager. Since then, we have combined and extended these two platforms to provide a readily available and an accessible Hadoopbased bioinformatics environment for the Cloud. Cloudgene, other than allowing arbitrary MapReduce tools to be integrated and used to craft an analysis, has been extended as a job execution engine for currently two dedicated services: an imputation service developed in cooperation with the Center for Statistical Genetics, University of Michigan (available at imputationserver.sph.umich.edu ) and a mtDNA analysis service (available at mtdnaserver.uibk.ac.at ). Thus far, the “Michigan Imputation Server” has shown remarkable popularity and scalability with over 690,000 human genomes being imputed within one year. These services have been deployed on dedicated hardware and offer a simple interface for the specific tasks while the jobs are being executed in the MapReduce fashion. This demonstrates a positive disposition towards wider adoption of MapReduce paradigm in the bioinformatics data analysis space given accessible and effective solutions. To facilitate easy access to such MapReduce solutions for bioinformatics and broaden the availability of these services, we have extended CloudMan to provide a Hadoopbased environment with preconfigured Cloudgene. CloudMan handles the tasks of procuring required cloud resources and configuring the appropriate environment, thus insulating the user from the lowlevel technical details otherwise required. Because CloudMan is compatible with multiple cloud technologies, it is now feasible to deploy this environment on a range of private and public clouds. This makes it possible for anyone to obtain a scalable Hadoopbased cluster with Cloudgene preinstalled and readily execute MapReduce tools. This talk will present the motivation for supporting greater adoption of MapReducebased applications in the bioinformatics data analysis space followed by the details of the described services and their functionality

    Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Research

    Get PDF
    - The data-driven parallelization framework Hadoop MapReduce allows analysing large data sets in a scalable way. Since the development of MapReduce programs can be a time-intensive and challenging task, the application and usage of Hadoop in Biomedical Research is still limited. Here we present Cloudflow, a high-level framework to hide the implementation details of Hadoop and to provide a set of building blocks to create biomedical pipelines in a more intuitive way. We demonstrate the benefit of Cloudflow on three different genetic use cases. It will be shown how the framework can be combined with the Hadoop workflow system Cloudgene and the cloud orchestration platform CloudMan to provide Hadoop pipelines as a service to everyone. The framework is open source and free available at https://github.com/genepi/cloudflow. Document type: Conference objec

    CONAN: copy number variation analysis software for genome-wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome-wide association studies (GWAS) based on single nucleotide polymorphisms (SNPs) revolutionized our perception of the genetic regulation of complex traits and diseases. Copy number variations (CNVs) promise to shed additional light on the genetic basis of monogenic as well as complex diseases and phenotypes. Indeed, the number of detected associations between CNVs and certain phenotypes are constantly increasing. However, while several software packages support the determination of CNVs from SNP chip data, the downstream statistical inference of CNV-phenotype associations is still subject to complicated and inefficient in-house solutions, thus strongly limiting the performance of GWAS based on CNVs.</p> <p>Results</p> <p>CONAN is a freely available client-server software solution which provides an intuitive graphical user interface for categorizing, analyzing and associating CNVs with phenotypes. Moreover, CONAN assists the evaluation process by visualizing detected associations via Manhattan plots in order to enable a rapid identification of genome-wide significant CNV regions. Various file formats including the information on CNVs in population samples are supported as input data.</p> <p>Conclusions</p> <p>CONAN facilitates the performance of GWAS based on CNVs and the visual analysis of calculated results. CONAN provides a rapid, valid and straightforward software solution to identify genetic variation underlying the 'missing' heritability for complex traits that remains unexplained by recent GWAS. The freely available software can be downloaded at <url>http://genepi-conan.i-med.ac.at</url>.</p

    A Dawson-like clustering of human mitochondrial DNA sequences based on protein coding region

    Get PDF
    In the present paper, our main goal is focused in developing fast algorithms for human mtDNA sequence analyses, requiring minimum and explicit assumptions on mutation models and evolutionary pathways. We propose a new approach based on a construction of Dawson, a technique based on the ordering of the variable sites. In this approach, the first step corresponds to the computation of the order of the positions according to their capacity to separate the sequences into dichotomous groups. Aiming to avoid or at least to minimize the consideration of ambiguous evolutionary events such as insertions/deletions and recurrence, which cause well-known alignment problems, in the present study we only work with the protein coding sequence, the clearly more stable region in human mitochondrial genomes. This method was tested in a small set of 99 human mtDNA comprising representatives of all major haplogroups. The developed approach showed to be a choice to automate the clustering of human mtDNA sequences into broad groups, the output being in agreement with the canonical classification into macro-haplogroups deposited in the Phylotree database

    A novel but frequent variant in LPA KIV-2 is associated with a pronounced Lp(a) and cardiovascular risk reduction

    Get PDF
    Aims Lp(a) concentrations represent a major cardiovascular risk factor and are almost entirely controlled by one single locus (LPA). However, many genetic factors in LPA governing the enormous variance of Lp(a) levels are still unknown. Since up to 70% of the LPA coding sequence are located in a difficult to access hypervariable copy number variation named KIV-2, we hypothesized that it may contain novel functional variants with pronounced effects on Lp(a) concentrations. We performed a large scale mutation analysis in the KIV-2 using an extreme phenotype approach Methods and results We compiled an discovery set of 123 samples showing discordance between LPA isoform phenotype and Lp(a) concentrations and controls. Using ultra-deep sequencing, we identified a splice site variant (G4925A) in preferential association with the smaller LPA isoforms. Follow-up in a European general population (n = 2892) revealed an exceptionally high carrier frequency of 22.1% in the general population. The variant explains 20.6% of the Lp(a) variance in carriers of low molecular weight (LMW) apo(a) isoforms (P = 5.75e-38) and reduces Lp(a) concentrations by 31.3 mg/dL. Accordingly the odds ratio for cardiovascular disease was reduced from 1.39 [95% confidence interval (CI): 1.17-1.66, P = 1.89e-04] for wildtype LMW individuals to 1.19 [95% CI: 0.92;1.56, P = 0.19] in LMW individuals who were additionally positive for G4925A. Functional studies point towards a reduction of splicing efficiency by this novel variant. Conclusion A highly frequent but until now undetected variant in the LPA KIV-2 region is strongly associated with reduced Lp(a) concentrations and reduced cardiovascular risk in LMW individuals

    Mitochondrial DNA heteroplasmy distinguishes disease manifestation in PINK1/PRKN-linked Parkinson’s disease

    Get PDF
    Biallelic mutations in PINK1/PRKN cause recessive Parkinson’s disease. Given the established role of PINK1/Parkin in regulating mitochondrial dynamics, we explored mitochondrial DNA (mtDNA) integrity and inflammation as disease modifiers in carriers of mutations in these genes. MtDNA integrity was investigated in a large collection of biallelic (n = 84) and monoallelic (n = 170) carriers of PINK1/PRKN mutations, idiopathic Parkinson’s disease patients (n = 67) and controls (n = 90). In addition, we studied global gene expression and serum cytokine levels in a subset. Affected and unaffected PINK1/PRKN monoallelic mutation carriers can be distinguished by heteroplasmic mtDNA variant load (AUC = 0.83, CI:0.74-0.93). Biallelic PINK1/PRKN mutation carriers harbor more heteroplasmic mtDNA variants in blood (p = 0.0006, Z = 3.63) compared to monoallelic mutation carriers. This enrichment was confirmed in iPSC-derived (controls, n = 3; biallelic PRKN mutation carriers, n = 4) and postmortem (control, n = 1; biallelic PRKN mutation carrier, n = 1) midbrain neurons. Lastly, the heteroplasmic mtDNA variant load correlated with IL6 levels in PINK1/PRKN mutation carriers (r = 0.57, p = 0.0074). PINK1/PRKN mutations predispose individuals to mtDNA variant accumulation in a dose- and disease-dependent manner

    Differential prognostic utility of adiposity measures in chronic kidney disease

    Get PDF
    Objective Adipose tissue contributes to adverse outcomes in chronic kidney disease (CKD), but there is uncertainty regarding the prognostic relevance of different adiposity measures. We analyzed the associations of neck circumference (NC), waist circumference (WC), and body mass index (BMI) with clinical outcomes in patients with mild to severe CKD. Methods The German Chronic Kidney Disease (GCKD) study is a prospective cohort study, which enrolled Caucasian adults with mild to severe CKD, defined as estimated glomerular filtration rate (eGFR): 30–60 mL/min/1.73 m2, or >60 mL/min/1.73 m2 in the presence of overt proteinuria. Associations of NC, WC and BMI with all-cause death, major cardiovascular events (MACE: a composite of non-fatal stroke, non-fatal myocardial infarction, peripheral artery disease intervention, and cardiovascular death), kidney failure (a composite of dialysis or transplantation) were analyzed using multivariable Cox proportional hazards regression models adjusted for confounders and the Akaike information criteria (AIC) were calculated. Models included sex interactions with adiposity measures. Results A total of 4537 participants (59% male) were included in the analysis. During a 6.5-year follow-up, 339 participants died, 510 experienced MACE, and 341 developed kidney failure. In fully adjusted models, NC was associated with all-cause death in women (HR 1.080 per cm; 95% CI 1.009–1.155), but not in men. Irrespective of sex, WC was associated with all-cause death (HR 1.014 per cm; 95% CI 1.005–1.038). NC and WC showed no association with MACE or kidney failure. BMI was not associated with any of the analyzed outcomes. Models of all-cause death including WC offered the best (lowest) AIC. Conclusion In Caucasian patients with mild to severe CKD, higher NC (in women) and WC were significantly associated with increased risk of death from any cause, but BMI was not

    Benchmarking Low-Frequency Variant Calling With Long-Read Data on Mitochondrial DNA

    Get PDF
    Background: Sequencing quality has improved over the last decade for long-reads, allowing for more accurate detection of somatic low-frequency variants. In this study, we used mixtures of mitochondrial samples with different haplogroups (i.e., a specific set of mitochondrial variants) to investigate the applicability of nanopore sequencing for low-frequency single nucleotide variant detection.Methods: We investigated the impact of base-calling, alignment/mapping, quality control steps, and variant calling by comparing the results to a previously derived short-read gold standard generated on the Illumina NextSeq. For nanopore sequencing, six mixtures of four different haplotypes were prepared, allowing us to reliably check for expected variants at the predefined 5%, 2%, and 1% mixture levels. We used two different versions of Guppy for base-calling, two aligners (i.e., Minimap2 and Ngmlr), and three variant callers (i.e., Mutserve2, Freebayes, and Nanopanel2) to compare low-frequency variants. We used F1 score measurements to assess the performance of variant calling.Results: We observed a mean read length of 11 kb and a mean overall read quality of 15. Ngmlr showed not only higher F1 scores but also higher allele frequencies (AF) of false-positive calls across the mixtures (mean F1 score = 0.83; false-positive allele frequencies 1 score = 0.82; false-positive AF 1 scores (5% level: F1 score >0.99, 2% level: F1 score >0.54, and 1% level: F1 score >0.70) across all callers and mixture levels.Conclusion: We here present the benchmarking for low-frequency variant calling with nanopore sequencing by identifying current limitations

    SNPflow: a lightweight application for the processing, storing and automatic quality checking of genotyping assays.

    Get PDF
    Single nucleotide polymorphisms (SNPs) play a prominent role in modern genetics. Current genotyping technologies such as Sequenom iPLEX, ABI TaqMan and KBioscience KASPar made the genotyping of huge SNP sets in large populations straightforward and allow the generation of hundreds of thousands of genotypes even in medium sized labs. While data generation is straightforward, the subsequent data conversion, storage and quality control steps are time-consuming, error-prone and require extensive bioinformatic support. In order to ease this tedious process, we developed SNPflow. SNPflow is a lightweight, intuitive and easily deployable application, which processes genotype data from Sequenom MassARRAY (iPLEX) and ABI 7900HT (TaqMan, KASPar) systems and is extendible to other genotyping methods as well. SNPflow automatically converts the raw output files to ready-to-use genotype lists, calculates all standard quality control values such as call rate, expected and real amount of replicates, minor allele frequency, absolute number of discordant replicates, discordance rate and the p-value of the HWE test, checks the plausibility of the observed genotype frequencies by comparing them to HapMap/1000-Genomes, provides a module for the processing of SNPs, which allow sex determination for DNA quality control purposes and, finally, stores all data in a relational database. SNPflow runs on all common operating systems and comes as both stand-alone version and multi-user version for laboratory-wide use. The software, a user manual, screenshots and a screencast illustrating the main features are available at http://genepi-snpflow.i-med.ac.at
    corecore